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As updated by GN25-0493 
VM/370 Release 6 PLC 4 


MULTIPLE SERVICE RECORD FILE (SRF) SUPPORT 


CHANNEL-SET SWITCHING FACILITY 


New : Device and Program Support 

VM/370 supports multiple SRFs in certain 
303x attached processor environments. 
The existence of multiple SRFs allows CP 
to retrieve MCH and CCH frames from each 
SRF device and record them in the error 
recording area. Interrupt handler 
routines identify whether the main 
processor or the attached processor 
generated particular MCH and CCH 
records. 


New; Hardware Feature 
Support 


and Program 


A Channel-set Switching facility exists 
in certain 303x attached processor 
environments. This facility allows CP 
to switch all active channels on the 
main processor to the attached processor 
when an uncorrectable error occurs on 
the main processor in problem program 
state. 
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3203 MODEL 5 PRINTER SUPPORT 


C hanged ; Documentation only 

• Figure 30 has been amended to include 
further documentation of error record 
types recorded by DOS, DOS/VS, 
OS/VS1, OS/VS2, and VM/370. 

• Correction of the default for the 
ACC= operand of CPEREP command in 
Figure 31 from NO to YES has been 
made. 

• An expanded description of the 
function of the CLEARF operand of 
CPEREP command has been added to 
Figure 32. 

• The term "error recording 
cylinder (s) " has been changed to 
"error recording area (s) " where 
applicable in the text. 

• Minor editorial changes have been 
made. 


New ; Hardware Support 

VM/370 now supports the 3203 Model 5 
printer for use in hardcopying errors 
encountered during diagnostic testing of 
the system. 
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Summary of Amendments 
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as updated by GN25-04 76 

VM/370 Release 5 PLC 12 


ENSURING VM/370 CONTROL PROGRAM HAS ACCESS 
TO SRF DEVICE 


C hanged ; Documentation only 

Added to the discussion of the SRF 
device as it relates to VM/370 is the 
means of ensuring that the VM/370 
control program has access to the SRF. 
Also documented are the steps necessary 
to activate the SRF device. 
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EDEVELOK to determine if intensive recording mode is in effect for this 
device. If the conditions are met, an I/O error record is created. 
This record is constructed and recorded as described previously. 
Control is returned to the I/O supervisor, which reflects the error to 
the user of the I/O operation. 
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Figure 12. Control Block Relationship for SDR Counter Update 
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I/O Error Recording and Error Recording Area 


The error recording facilities of VM/370 format and record outboard 
error records, and record formatted machine check and channel check 
records created by the EMS routines of VM/370. 

The error recording routines of VM/370 do not actually perform I/O 
operations. Instead, the I/O error routines treat the error recording 
area allocated on the VM/370 system residence pack as a logical 
extension of VM/370 storage. These extensions of VM/370 storage are in 
the form of logical pages that can be read and written out of by the 
paging supervisor of VM/370. The error recording routines place 
multiple error records within a page; when an error record is assembled 
within a page, a pointer is updated to indicate the beginning of any 
unused area. The next error record is checked to see if it can be 
contained in the remainder of this page. If it can, the error record is 
read into the page and the pointer is updated to again reflect any 
residual storage available for the next error record. This process 
continues until an error record is encountered that cannot be contained 
within the page. When this happens, the page is scheduled to be read 
out to the next available slot in the error recording area and a new 
page in storage is assigned to accept and retain the error record. The 
process continues in like manner. 

The error recording area is from two to nine adjacent cylinders 
assigned on the system residence volume. The starting cylinder number 
and number of cylinders are specified in VM/370 generation procedures. 
When the error recording area is 90% full, and again when 100% full, the 
I/O error routines instruct the VM/370 system operator to invoke the 
CPEREP command to print (or create a tape of) the error data and erase 
the recording area. Errors are recorded in the order of occurrence 
until the allotted space is exhausted. 

Because of the support provided for the 303x processors in 
uniprocessor or attached processor modes, CPEREP processing is not 
dependent on the content or engineering change (EC) level of the 
processor logouts to format machine check and channel check records. 
Instead, the 7443 Service Record File (SRF) device provides format and 
content information contained in frames on diskette to format MCH and 
CCH records. In a 303x attached processor environment, each processor 
has its own SRF device. Customer engineering maintains the SRF frames 
(records containing text and scan buffer codes to format MCH and CCH 
records) on each SRF device. CPEREP makes use of these frames to 
interpret and format inboard errors for hardcopy output. 

At initialization, the VM/370 system control program recognizes the 
presence of multiple SRF devices in certain 303x attached processor 
environments. CP accesses the SRF device (s) at initialization, 
retrieves the frames, and records them at the beginning of the error 
recording area. When multiple SRF devices exist in a 303x AP 
environment, the header portion of each SRF frame record written to the 
error recording area identifies the processor by processor number and 
model number. The interrupt handler routine identifies which MCH and 
CCH records the main processor generated and which records the attached 
processor generated. In this way, CPEREP uses SRF frames to format MCH 
and CCH records for printed reports by matching the inboard error 
records to their respective frames. 

Each time an engineering change (EC) reguiring a new diskette is 
installed in a 303x uniprocessor or in certain 303x attached processor 
environments, the privilege class F user must issue the CPEREP CLEARF 

48 IBM VM/370 OLTSEP 8 Error Recording Guide 


Page of GC20-1809-7 As Updated Aug. 1, 1979 by TNL GN25-0493 

| command. This command clears and reformats the error recording area by 
| accessing the format information in the SRF frames on the newly 

■ - j i-i_j j .: _ i, — a. j 

j iiiotancu UlSACtLC^ 

I In 303x uniprocessor mode or in certain 303x attached processor 

| environments, system generation procedures provide support for the SRF 

| device (s) so that CPEREP can properly format machine check and channel 

\ check records created by each processor. A channel path must also exist 

\ between the main processor and the SRF of the attached processor in a 

| 303x attached processor environment. Establishing this channel path 

1 allows CP to read frames from each of the SRF devices to the error 

| recording area. Refer to VM/370 P lanning and System Gen erat ion Guide 

| for the requirements needed to generate support for the SRF device (s) . 

The SRF device is accessed by VM/37Q to read frame data (a) during 
VM/370 system initialization if the error recording cylinders have not 
been previously formatted; and (b) as a result of running CPEREP with 
| the CLEARF operand. To ensure that the VM/370 control program has 
access to the SRF device after initialization, the following steps 
should be followed to activate the SRF: 

1. Check that the I/O interface for the service support console is 
enabled. 

2. Obtain the configuration frame on the service support console. 

3. The SRF appears disabled until accessed on the 3032. Activate the 
SRF on the 3031 and 3033 by selecting SRF mode A2. 


Section 3. Error Handling 48.1 


Aug. 1, 1979 


48,2 IBM VM/370 OLTSEP B Error Recording Guide 


Page of GC2C-1809-"7 As Updated Aug. 1 , 1979 by TNL GN25-CU93 

If CCH determines that system integrity has been damaged (for 
example, if the channel has been reset, or if the device address stored 
is invalid), CCH places the system in a disabled wait state and sends a 
message to the VM/370 primary system operator, For the 4331 and 4341 
processors, limited channel logout is still available, but no fixed or 
I/O extended logout area exists. 


HANDLING OF HARD MACHINE CHECKS 

If a permanent error (hard machine check) occurs on the main processor 
(or attached processor) , the error is analyzed to determine whether or 
not it is correctable by programming. Time-of-day clock and timer errors 
that result in a machine check interruption that are not correctable and 
cannot be circumvented place the real computing system in a disabled 
wait state. 

Uncorrectable or unretryable processor errors, storage errors, and 
storage protect key failures are handled as discussed in the following 
paragraphs. 


Processor Errors 

When a machine check interruption indicates that a processor error 
associated with VM/370 cannot be corrected or retried the system 
operator is informed of the error and the system is put in a disabled 
wait state. All virtual machine users must log on again. If the error 
is associated with a virtual machine, the user is informed of the error 
and the virtual machine is reset, unless it is using the virtual=real 
option. In that case, the virtual machine is terminated, and the user 
must then log on and reinitialize (via IPL) his machine. 

If VM/370 is being run in attached processor mode and an 
uncorrectable error is encountered on the attached processor while 
executing in problem program state, system operation continues in 
uniprocessor mode on the main processor. 

In certain 303x attached processor environments, a Channel— set 
Switching facility may exist. This facility allows processing to 
continue on the attached processor in uniprocessor mode after the main 
processor enters a disabled wait state following a hard machine check or 
channel check that results in an uncorrectable error. Automatic 
processor recovery routines test for the Channel-set Switching facility. 
If the facility is present, CP switches all active channels on the main 
processor to the attached processor, and the processing continues on the 
attached processor in uniprocessor mode. Refer to VM/370 Planning and 
S ystem Gene rat ion Guide for the specific 303x attached processors that 
support Channel-set Switching. 


Stor age Error s in a Virtual Machine Page 

When the control program (CP) detects a permanent storage error (hard 
machine check) in a real storage page frame that is being used by a 
virtual machine, the frame is marked invalid if the error is 
intermittent, or unavailable if the error is solid. If the page frame 
has not been altered by the virtual machine, a new page frame is 
assigned to the virtual machine and a backup copy of the page is brought 
in the next time the page is referenced. All storage errors are 
transparent to the virtual machine user. 
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If the page frame has been altered, VM/370 resets the virtual 
machine, clears its virtual storage to zeros, and sends an appropriate 
message to the user. If the virtual machine is using the virtual=real 
option, it is terminated. In either case, normal system operation 
continues for all other users. 


Storage Errors in the CP Nucleus 
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cause VM/370 to terminate. 
ECC, as noted above.) 


in the CP nucleus cannot be corrected* the v 
(Single-bit storage errors are corrected by 


Storage Protect Ke_y Failures 


When intermittent storage protect key failures occur, whether associated 
with VM/370 or a virtual machine, the key is corrected and operation 
continues. 

If the storage protect key error is uncorrectable (solid) and is 
associated with a virtual machine, the user is notified and the virtual 
machine is terminated. The page frame is marked unavailable. 
Uncorrectable storaqe protect key failures associated with VM/370 cause 
the VM/370 system to be terminated. An automatic restart reinitializes 
VM/370. 


HANDLING OF SOFT MACHINE CHECKS 


Although hard machine checks always cause a machine check interruption 
to occur and logouts to be written, soft machine checks are handled in 
one of two operating modes -- recording mode or guiet mode. 

• In recording mode, soft machine checks cause machine check 
interruptions and write logouts. 

• In guiet mode, only hard machine checks cause machine check 
interruptions and write logouts. 

The normal operating state of VM/370 for CPU retry reporting is 
recording mode. For ECC (error checking and correction) reporting, the 
initialized (normal) state of VM/370 is model-dependent: guiet mode for 
all VM/370-supported processors except Models 155II and the 165II. The 
initial state for the 155II and 165II is record mode. 


A change from recording mode to quiet mode can occur in one of two 
ways: when 12 soft machine checks have occurred, or when the SET MODE 
RETRY/MAIN QUIET command is executed by maintenance personnel. 


To revert to record mode 
FECOFD must be issued. 


again, the command SET MODE RETRY/MAIN 


In attached processor applications, soft error recording can be set 
or reset for the selected processor if so desired. 

If a soft machine check (a transient error) occurs while the system 
is in recording mode, a machine check record containing information 
about the error is written on the error recording cylinders. This 
record includes the data in the fixed logout area, the date, the time of 
day, and other pertinent data. The operator is not informed that a soft 
machine check has occurred. 
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If a transient error occurs while the system is in quiet mode, no 
machine check interruption occurs, and no logouts are written. The 
hardware, which had gained control when the soft machine check occurred, 
returns control to either VM/370 or the problem program, depending on 
which had control at the time the machine check occurred. 
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machine check interruption, the processor logs out fields of information 
in main storage detailing the cause and nature of the error. The model 
independent data is stored in the fixed logout area and the model 
dependent data is stored in the extended logout area. The machine check 
handler uses these fields to analyze the error and to produce the error 
report . 

If the machine fails to recover from the error through its own 
recovery facilities, a machine check interruption occurs, and the fixed 
logout contains an interruption code that indicates the recovery attempt 
was unsuccessful. The machine check handler then analyzes the data and 
attempts to keep the system as fully operational as possible. The cause 
of the malfunction determines what actions MCH takes: 

• Resume operations leaving no adverse effects on the system. 

• Resume system operations by terminating the user that was 
interrupted . 

• Isolate the failure to a page and flag the page as invalid or 
unavailable for use by the paging supervisor. 

• Place the system in a disabled wait state. 

• In VM/370 attached processor operations, processing may continue in 
uniprocessor mode if the attached processor malfunctions while in 
problem program state and recovery is not possible. 

• In certain 303x attached processor environments, processing may 
continue in uniprocessor mode on the attached processor by the 
Channel-set Switching facility. If this facility is present, CP 
switches all active channels on the main processor to the attached 
processor if the main processor malfunctions while in the problem 
program state and recovery is not possible. 

Note: Loss of system integrity prevents the recording of hard machine 
checks in the supervisor (CP) . Error information of this type may be 
obtained through the use of the processor's hard stop facility if the 
machine check is repetitive. 

LEVELS OF ERROR RECOVERY 

Recovery from machine malfunctions can be divided into the following 
categories: functional recovery, system recovery, operator-initiated 
restart, and system repair. These levels of error recovery are 
discussed in order from the easiest type of recovery to the most 
difficult. 

Functional Reco very 

Functional recovery is recovery from a machine check without adverse 
effect on the system or the interrupted user. This type of recovery can 
be made by either the processor retry or the ECC facility, or the 
machine check handler. The processor retry and ECC error correcting 
facilities are discussed separately in this section since they are 
significant in the total error recovery scheme. Functional recovery by 
the machine check handler is made by correcting Storage Protect Feature 
(SPF) keys and intermittent errors in main storage. 
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System Rec ove ry 

System recovery is attempted when functional recovery is impossible. 
System recovery is the continuation of system operations by terminating 
the user who experienced the error. System recovery can take place only 
if the user in guestion is not critical to continued system operation. 
A system routine containing an error that is considered to be critical 
to system operation precludes functional recovery and would reguire 
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Operator Initiated Restart 

When the errors may have caused a loss of supervisor or system 
integrity, the system is put into a disabled wait state. The operator 
must then reload the system. 

System R epair 

If system recovery is not possible, the system may reguire the services 
of maintenance personnel to effect a system hardware repair. System 
repair by this method occurs when the error is so critical to system 
operations that the system cannot be used to record the error. 

MACHINE CHECK HANDLER — SUMMARY 

The machine check handler (MCH) consists of entirely resident routines 
in the CP nucleus. 

Recovery from most machine malfunctions on System/370 is initially 
attempted by the instruction retry, and the error checking and 
correction (ECC) machine facilities. However, if the retry or storage 
correction is unsuccessful, if the interrupted instruction is 
non-retryable, or if the storage failure cannot be repaired, RMS will 
assess the damage and do the following: 

• If the fault is an SPF key failure, refresh the key if conditions 
warrant such action. 

• If the fault is related to main storage, either (1) refresh that page 
or (2) have CP flag that page as unusable and assign a new page; then 
refresh the data if valid to do so. 

• Terminate or reset the virtual machine if the malfunction cannot be 
repaired but is traceable to a particular virtual machine. 

• Terminate all SCP operations and post a wait state code if system 
integrity is lost and nonrecoverable. 

• In attached processor applications, if the malfunction is associated 
with the attached processor while running in problem program state 
and attached processor recovery is not possible, cease all operations 
on the attached processor and allow the system to continue in 
uniprocessor mode on the main processor. 

• If the error is a channel group inoperative on a 3031, 3032 or 3033 
processor, place the system in a disabled wait state. 
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Any of the above conditions can produce one or more of the following 
results: 

Wherever possible, a record of the error is produced in the 
system's error recording area. 

— Wherever possible, the primary system operator is informed of the 
error. 

Errors corrected by instruction retry and main storage errors 

corrected by ECC are not reflected to the system operator's console, and 

these errors may or may not be recorded. See "Recovery Modes" in this 

section for a discussion of this. 

The messages produced by the machine check handler on supported 
VM/370 systems are described in VM/370 S yst em Messag es. Wait state 
codes 001 and 013 produced by the machine check handler routines are 
also described in VM/370 System Messages. 
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Operand 


Description 


ACC= 


CLEAR 
CLEARF 


Indicates that selected error records are to be accumulated 
in an output data set. The particular error records 
selected and the source of these records (either the VM/370 
error recording cylinders, or a history file, or both) 
depends on what other operands are coded. The output 
accumulation data set is normally a tape mounted on tape 
drive 181, but this can be changed (see the section "CPEREP 
FIIEDEFS") . When output is accumulated on tape 181, the 
output is added as an extension of the existing file: the 
tape is rewound and then spaced out to the end of the first 
file prior to writing. Therefore, if a tape is to be used 
for the first time, the user should write a tape mark at 
the beginning of the tape before invoking CPEREP (the CMS 
TAPE command can do this) . When output is accumulated on a 
tape, the tape should be mounted, readied, and attached to 
the user's virtual machine as tape 181 prior to invoking 
CPEREP. Note that for most types of reports, ACC=Y is the 
default. 
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An error record selection operand. — It alio 
of error records by the central processor u 
model number. Multiple processor values m 
as multiple sub-operands of CPU. 
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nit's serial and 
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CPUCUA | An error record selection operand. — It allows the selection 

of error records that relate to a specific processor 

(serial address) and an attached device (cuu) address. 

j Multiple processor and devices can be specified as multiple 

| sub-operands of CPUCUA. 

j 


Figure 32. Operands Used with CPEREP (Part 1 of 5) 
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Operand 


CTLCFD 


Description 

An error record selection operand. — When the RDESUM operand 
requests an IPL report, CTLCRD controls the selection of 
error records via its span of dates and IPL clustering 
interval. 


Note : T 
title op 
line of 
operands 
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DATE= 


An error record selection operand. — It allows the selection 
of error records by the date or span of dates (Julian day 
values) specified. 


DEV= 


An error record selection operand. — It allows the selection 
of error records by device type (for example, 2314, 3330). 
Multiple device types can be specified as multiple 
suboperands of DEV. 


DEVSER= 


An error record selection operand. — It allows the selection 
of error records by the specific device serial number in 
the service data field in the error record. This operand 
is valid for only 3410/3420 devices. Multiple device 
serial numbers can be specified as multiple suboperands of 
DEVSER. 


ERRORID= 


An error record selection operand. — It applies only to MCH 
and software records generated by OS/VS2 MVS. It allows 
selection by the five digit error identifier alone or by 
the five digit error identifier, processor identifier, 
address space identifier, and date/time values. 


EVENT= 


A report generation operand. — This operand generates one 
line abstracts of all or selected error records in 
chronological order. 
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Figure 32. Operands Used with CPEREP (Part 2 of 5) 
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Index 


The entries in this Index are accumulative. They list additions to this publication by 
the following VM/370 System Control Program Products: 

• VM/370 Basic System Extensions, Program Number 5748-XX8 

• VM/370 System Extensions, Program Number 5748-XE1 

However, the text within the publication is not accumulative; it only relates to the one 
SCP program product that is installed on your system. Therefore, there may be topics and 
references listed in this Index that are not contained in the body of this publication. 
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codes 

line transmission 19 

wait state 74 
conditions, terminal communication line 22 
console functions, systems, CP command 

equivalency 3 
control block linkage 

environmental data recording 51 

fatal error 50 

I/O operation 44 

I/O retry 46 

SDR recording 47 

structure for sense byte analysis 45 

2305 environmental data recording 50 

3330/3340/3350 environmental data 
recording 51 
control units, line 18 

correspondence (line transmission code) 19 
CP commands, equivalency to system console 

functions 3 
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applications 96 

brief description of use 

CMS the environment for 

command entry 

file entry method 92 
mixed entry method 92 
prompting method 91 

command format 83-84 
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ECHO command 
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data set requirements 93 
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RDEVELOK to determine if intensive recording mode is in effect for this 
device. If the conditions are met, an I/O error record is created. 
This record is constructed and recorded as described previously. 
Control is returned to the I/O supervisor, which reflects the error to 
the user of the I/O operation. 
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Figure 12. Control Block Relationship for SDE Counter Update 


Section 3. Error Handling 47 


Page of GC20-1809-7 As Updated Aug. 31, 1979 by TNL SN25-0760 

For 5748-XE1 

I/O Error Recording and Error Recording Area 


The error recording facilities of VM/370 format and record outboard 
error records, and record formatted machine check and channel check 
records created by the EMS routines of VM/370. 

The error recording routines of VM/370 do not actually perform I/O 
operations. Instead, the I/O error routines treat the error recording 
area allocated on the VM/370 system residence pack as a logical 
extension of VM/370 storage. These extensions of VM/370 storage are in 
the form of logical pages that can be read and written out of by the 
paging supervisor of VM/370. The error recording routines place 
multiple error records within a page; when an error record is assembled 
within a page, a pointer is updated to indicate the beginning of any 
unused area. The next error record is checked to see if it can be 
contained in the remainder of this page. If it can, the error record is 
read into the page and the pointer is updated to again reflect any 
residual storage available for the next error record. This process 
continues until an error record is encountered that cannot be contained 
within the page. When this happens, the page is scheduled to be read 
out to the next available slot in the error recording area and a new 
page in storage is assigned to accept and retain the error record. The 
process continues in like manner. 

On count-key-data devices, the error recording area is from two to 
nine adjacent cylinders assigned on the system residence volume. The 
starting cylinder number and number of cylinders are specified in VM/370 
generation procedures. On FB-512 devices the error recording area is 
any number of adjacent pages assigned on the system residence volume. 
The starting page number and the number of pages are specified in the 
VM/370 generation procedures. In any case when the error recording area 
is 90% full, and again when 100% full, the I/O error routines instruct 
the VM/370 system operator to invoke the CPEREP command to print (or 
create a tape of) the error data and erase the recording area. Errors 
are recorded in the order of occurrence until the allotted space is 
exhausted. 

Because of the support provided for the 303x processors in 
uniprocessor or attached processor modes, CPEREP processing is not 
dependent on the content or engineering change (EC) level of the 
processor logouts to format machine check and channel check records. 
Instead, the 7443 Service Record File (SRF) device provides format and 
content information contained in frames on diskette to format MCH and 
CCH records. In a 303x attached processor environment, each processor 
has its own SRF device. Customer engineering maintains the SRF frames 
(records containing text and scan buffer codes to format MCH and CCH 
records) on each SRF device. CPEREP makes use of these frames to 
interpret and format inboard errors for hardcopy output. 

At initialization, the VM/370 system control program recognizes the 
presence of multiple SRF devices in certain 303x attached processor 
environments. CP accesses the SRF device (s) at initialization, 
retrieves the frames, and records them at the beginning of the error 
recording area. When multiple SRF devices exist in a 303x AP 
environment, the header portion of each SRF frame record written to the 
error recording area identifies the processor by processor number and 
model number. The interrupt handler routine identifies which MCH and 
CCH records the main processor generated and which records the attached 
processor generated. In this way, CPEREP uses SRF frames to format MCH 
and CCH records for printed reports by matching the inboard error 
records to their respective frames. 
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Each time an engineering change (EC) reguiring a new diskette is 
installed in a 303x uniprocessor or in certain 303x attached processor 
environments, the privilege class F user must issue the CFEREP CLEAEF 
command. This command clears and reformats the error recording area by 
accessing the format information in the SRF frames on the newly 
installed diskette. 

In 303x uniprocessor mode or in certain 303x attached processor 
environments, system generation procedures provide support for the SRF 
device (s) so that CPEREP can properly format machine check and channel 
check records created by each processor. A channel path must also exist 
between the main processor and the SRF of the attached processor in a 
303x attached processor environment. Establishing this channel path 
allows CP to read frames from each of the SRF devices to the error 
{ recording area. Refer to VM/370 Planning and Syst em Gene ration Guide 
| for the reguirements needed to generate support for the SRF device (s). 

The SRF device is accessed by VM/370 to read frame data (a) during 
VM/370 system initialization if the error recording cylinders have not 
been previously formatted; and (b) as a result of running CPEREP with 
| the CLEARF operand. To ensure that the VM/370 control program has 
access to the SRF device after initialization, the following steps 
should be followed to activate the SRF: 

1. Check that the I/O interface for the service support console is 
enabled. 

2. Obtain the configuration frame on the service support console. 

3. The SRF appears disabled until accessed on the 3032. Activate the 
SRF on the 3031 and 3033 by selecting SRF mode A2. 
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If CCH determines that system integrity has been damaged (for 
example, if the channel has been reset, or if the device address stored 
is invalid) . CCH places the system in a disabled wait state and sends a 
message to the VM/370 primary system operator. For the 4331 and 4341 
processors, limited channel logout is still available, but no fixed or 
I/O extended logout area exists. 

Virtual machines for which VMSAVE (Directory option or SET command 
operand) is enabled normally have their register and storage contents 
saved in the event of certain abend situations. However, catastrophic 
channel errors cause a disabled wait PSW to be loaded and may prevent 
saving the contents of a virtual machine. 


HANDLING OF HARD MACHINE CHECKS 

If a permanent error (hard machine check) occurs on the main processor 
(or attached processor) , the error is analyzed to determine whether or 
not it is correctable by programming. Time-of-day clock and timer errors 
that result in a machine check interruption that are not correctable and 
cannot be circumvented place the real computing system in a disabled 
wait state. 

Uncorrectable or unretryable processor errors, storage errors, and 
storage protect key failures are handled as discussed in the following 
paragraphs. 

Pr o cesso r E rror s 

When a machine check interruption indicates that a processor error 
associated with VM/370 cannot be corrected or retried the system 
operator is informed of the error and the system is put in a disabled 
wait state. All virtual machine users must log on again. If the error 
is associated with a virtual machine, the user is informed of the error 
and the virtual machine is reset, unless it is using the virtual=real 
option. In that case, the virtual machine is terminated, and the user 
must then log on and reinitialize (via IPL) his machine. 

If VM/370 is being run in attached processor mode and an 
uncorrectable error is encountered on the attached processor while 
executing in problem program state, system operation continues in 
uniprocessor mode on the main processor. 

In certain 303x attached processor environments, a Channel-set 
Switching facility may exist. This facility allows processing to 
continue on the attached processor in uniprocessor mode after the main 
processor enters a disabled wait state following a hard machine check or 
channel check that results in an uncorrectable error. Automatic 
processor recovery routines test for the Channel-set Switching facility. 
If the facility is present, CP switches all active channels on the main 
processor to the attached processor, and the processing continues on the 
attached processor in uniprocessor mode. Refer to' VM/370 Plan nin g an d 
Sys te m gener a tion Guide for the specific 303x attached processors that 
support Channel-set Switching. 

S torage Err ors in a Vi rtual Machine Page 

When the control program (CP) detects a permanent storage error (hard 
machine check) in a real storage page frame that is being used by a 
virtual machine, the frame is marked invalid if the error is 
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intermittent, or unavailable if the error is solid. If the page frame 
has not been altered by the virtual machine, a new page frame is 
assigned to the virtual machine and a backup copy of the page is brought 
in the next time the page is referenced. All storage errors are 
transparent to the virtual machine user. 

If the page frame has been altered, VM/370 resets the virtual 
machine, clears its virtual storage to zeros, and sends an appropriate 
message to the user. If the virtual machine is using the virtual=real 
option, it is terminated. In either case, normal system operation 
continues for all* other users. 


Storage Errors in the CP N ucleu s 

Multiple-bit storage errors in the CP nucleus cannot be corrected; they 
cause VM/370 to terminate. (Single-bit storage errors are corrected by 
ECC, as noted above.) 


Storage Protect K ey F ailures 

When intermittent storage protect key failures occur, whether associated 
with VH/370 or a virtual machine, the key is corrected and operation 
continues. 

If the storage protect key error is uncorrectable (solid) and is 
associated with a virtual machine, the user is notified and the virtual 
machine is terminated. The page frame is marked unavailable. 
Uncorrectable storage protect key failures associated with VM/370 cause 
the VM/370 system to be terminated. An automatic restart reinitializes 
VM/370. 


HANDLING OF SOFT MACHINE CHECKS 

Although hard machine checks always cause a machine check interruption 
to occur and logouts to be written, soft machine checks are handled in 
one of two operating modes — recording mode or guiet mode. 

• In recording mode, soft machine checks cause machine check 
interruptions and write logouts. 

• In guiet mode, only hard machine checks cause machine check 
interruptions and write logouts. 

The normal operating state of VM/370 for CPU retry reporting is 
recording mode. For ECC (error checking and correction) reporting, the 
initialized (normal) state of VM/370 is model-dependent: guiet mode for 
all VM/370-supported processors except Models 155II and the 16 511. The 
initial state for the 155II and 165II is record mode. 

A change from recording mode to guiet mode can occur in one of two 
ways: when 12 soft machine checks have occurred, or when the SET MODE 
RETRY/MAIN QUIET command is executed by maintenance personnel. 

To revert to record mode again, the command SET MODE RETRY/MAIN 
RECORD must be issued. 

In attached processor applications, soft error recording can be set 
or reset for the selected processor if so desired. 
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If a soft machine check (a transient error) occurs while the system 
is in recording mode- a machine check record containing information 
about the error is written on the error recording cylinders. This 
record includes the data in the fixed logout area, the date, the time of 
day, and other pertinent data. The operator is not informed that a soft 
machine check has occurred. 

If a transient error occurs while the system is in quiet mode, no 
machine check interruption occurs, and no logouts are written. The 
hardware, which had gained control when the soft machine check occurred, 
returns control to either VM/370 or the problem program, depending on 
which had control at the time the machine check occurred. 
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